This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
RStudio shows you four panes:
You can arrange these in any order using Tools > Global Options.
RStudio has fantastic autocomplete capabilites. To autocomplete just press TAB. This is especially useful when loading files as using autocomplete will help you to identify relevant ones.
At the very beginning of an R session you MUST set a ‘working directory’. This tells R where to look for and save files. An easy way to do this is from RStudio. We’re going to look at three ways to dos this.
Within RStudio go to Session >> Set Working Directory > Choose Directory...
You also have an option to set the working directory to the most recently-loaded .R or .Rmd file.
To do this type
setwd("path/to/directory")
Unfortunately, if you are on a windows machine you will need to change all backslashes \ to forward slashes /. This is because R follows UNIX conventions which are native to Linux and Mac computers.
If you are not sure what your working directory is type
getwd()
Getting the right path is a vital first step in R, and you really need to know how to do this. Here are some instructional videos if you get stuck…
For Windows computers refer to this YouTube video. For Macs refer to this YouTube video. If you’re on Linux then refer to this YouTube video
However this is a big problem with this approach. If you send your file to someone else who is working on a different machine, it is most likely that your path will be pointing to the wrong location
By far the best way to set the working directory is create and “R Project” in RStudio. When you do this, the RProject stores all files and data objects which were opened in the last session, and also keeps track of where the files are stored, so there is no need to set the directory.
We can use the console for general arithmetic
1 + 3 + 5
[1] 9
(2 + 9) / 7
[1] 1.571429
We can also create variables, e.g.
x <- 10
x = 10 # Does the same thing!!
y <- 21
x*y
[1] 210
Most work in R is done using Functions. These take the following form: function(argument(s)). Here are some functions
sqrt(10)
[1] 3.162278
seq(1,10,2)
[1] 1 3 5 7 9
rep(5, 10)
[1] 5 5 5 5 5 5 5 5 5 5
EX1: What do the arguments of seq and rep do? To find out more search for the relevant help file in the console by typing ?seq or by using Google.
EX2: Have a look at the following arguments called gsub and grepl. What do they do? Clue: if you’re stuck, search the help file using ?
gsub("R-studio", "Rstudio", "R-studio is a great piece of software")
[1] "Rstudio is a great piece of software"
grepl("chocolate", "Mary likes chocolate cookies")
[1] TRUE
It’s possible to create your own functions. This makes R extremely powerful and extendible. We’re not going to cover making your own functions in this course, but it’s important to be aware of this capability. There are plenty of good resources online for learning how to do this, including this one
As we have seen above, to find out about a particular function just type ? and the name of the function into the console, e.g. ?grepl. This accesses the help files on your computer. If you’d like to search more broadly type ??grepl and your computer will look online for relevant materials on CRAN (the main R website)
Help files in R are quite densely written and not particularly aimed at beginners. Fortunately there are loads of excellent resources on the internet. Here are some really good sites:
And there are plenty more! If you find a good one share it with your colleagues via email, Twitter, or whatever social media you prefer!
To enhance the basic capabilities of R, we need to load packages/libraries. Most of the time, we download these from ‘CRAN’ Tools > Install packages or install.packages(). Once the package/library is installed (i.e. it is sitting somewhere on your computer), we then need to load it to the current R session using the library() function.
Remember using a package/library is a two-stage process. We
Install the package/library onto your computer (from the internet)Load the package/library into your current session using the library command.One of the most useful packages is called ‘tidyverse’.
It contains a number of useful commands for plots, and data manipulation.
Install the ‘tidyverse’ package, and then load it with the following function:
library(tidyverse)
I find that a particularly easy way to load packages is via the p_load function from the pacman library. We are not going to practise using it, but just to let you know that it exists!
To find out more about a package type ?package_name in the console. Alternatively you can look for the package documentation on CRAN.
Most of the functions loaded in a package should work ‘out of the box’. However occasionally you need to refer to the package first, and then the function using the format package_name::function_from_that_package. This is useful for a variety of reasons:
ggplot2geom_point from this package. What does it do?A variable is a type of ‘object’ which R stores in memory. R is capable of creating and storing a wide range of objects. To see what type of object we have created, we use the function class(), e.g.
class(x)
[1] "numeric"
z <- "hello"
class(z)
[1] "character"
class is one of the most useful functions in R as errors are often due to misassignment of class, e.g.
x + z
Error in x + z : non-numeric argument to binary operator
Here we have tried to add a number to a string which is clearly impossible. It’s possible to change the class of an object using commands such as as.character, as.integer, as.numeric, as.factor, e.g.
one <- "1"
x + one
Error in x + one : non-numeric argument to binary operator
one <- as.numeric(one)
x + one
[1] 11
Here is a list of the main object classes in R:
In order to create a vector we need to use the c function. (c = ‘combine’), e.g.
list.of.numbers <- c(1,4,54,22,43,9,0,0,21)
mean(list.of.numbers)
[1] 17.11111
sd(list.of.numbers)
[1] 19.85223
a.character.vector <- c("Mary", "Jane", "Ali", "Chen")
a.list <- as.list(c(1, 2, "Mary", "Jane"))
A data frame is a two-dimensional object containing variables and row numbers. It’s basically a spreadsheet.
The following code creates a data frame programmatically. It creates two variables, and combines them together to make a data frame. Note that to do this we need to use the functions as.data.frame and cbind.
list.of.movies <- c("Independence Day", "Pretty Woman", "The Godfather Part
Two", "Planet of the Apes (original)")
rotten.tomatoes.variable <- c(62, 61, 97, 89)
df <- as.data.frame(cbind(list.of.movies, rotten.tomatoes.variable)) # 'cbind' binds columns together
To glimpse the top few rows type head(name_of_data_frame) in the console, e.g.
head(df)
To view the data frame in the ‘source’ window, type View(name_of_data_frame) in the console, .e.g.
View(df) #NB first letter is a capital letter.
To refer to variables, use the following syntax data_frame_name$variable_name, e.g.
df$list.of.movies
[1] "Independence Day" "Pretty Woman"
[3] "The Godfather Part\nTwo" "Planet of the Apes (original)"
When naming variables we can use dots and underscores, e.g. df$list.of.movies and df$list_of_movies. We can use numbers as long as they don’t come at the beginning, e.g. df$list_of_movies.v3.
If you use this convention, then the names for variables can get very long. However, it’s generally useful, as in R you often have multiple data frames loaded into memory. By specifiying both the name of the data frame and the variable, this avoids confusion.
Try to be consistent with your naming conventions. I tend to use underscores to name variables, e.g. data.frame.x$variable_y. This is also what Hadley Wickham recommends (Have a look at the Tidyverse Style Guide)
If you’d like to see all the variable names in a data frame type names(data_frame), e.g.
names(df)
[1] "list.of.movies" "rotten.tomatoes.variable"
Whenever you wish to access the contents of an object with multiple values (e.g. a data frame) you use indexes. These are placed inside square brackets, e.g. [1]. Have a look at the following example:
df[1,2]
[1] "62"
df[1,] # here the second number is blank
df[,2] # here the first number is blank
[1] "62" "61" "97" "89"
What does each number refer to? What happens when we leave a blank cell?
However, rather than use the menu, it’s much better to use actual code, as this will automate the process. Let’s import a dataset on World Happiness Report (2017), by country. The files are WHR_2017.xlsx, WHR_2017.sav, and WHR_2017.csv. Alternatively you can actually download the data set straight from the URL (below)
library(tidyverse)
library(haven)
df <- readxl::read_excel("WHR_2017.xlsx") # Read an excel file
df <- haven::read_spss("WHR_2017.sav") # Read from an SPSS file
df <- read_csv("WHR_2017.csv") # Read from a .csv file
#Or to download straight from the URL!!
df <- read_csv("https://verbingnouns.github.io/AdventuresInR/docs/WHR_2017.csv")
Possibly the best data format to work in is the .csv data format (Comma-Separated Value). This is good because it is readable in Excel, small, simple, and not easily-corrupted.
To read .csv files we use the read.csv() function from base R, or read_csv() from the tidyverse (I would go with the latter as it also shows you a list of the variable types)
We’re going to subset the WHR dataset (i.e. choose only those cases/observations which fulfil a specific criterion). To do this we’re going to use the which() function. When you apply which to a variable in a dataset, it will produce indices (indexes) of the rows which fulfil a certain criterio, e.g. which(df$var_name == 2) will give you the indices of all rows where the value of the variable is 2.
Armed with this knowledge, your task is to subset the data frame so that it only contains information from African countries.
If you’re stuck have a look at the answer below.
df.Africa <- df[which(df$region == "Africa"), ]
Okay, the above code is pretty horrible to look at, so we’re going to explore an alternative using the package dplyr which is from the tidyverse. But before we can use dplyr we have to learn how to ‘pipe’.
Pipes are written in R as %>% (note you must use a percentage sign before and after the pipe). To demonstrate what pipes do, I have a look at the following pseudocode.
All pipes do is enable us to ‘pass’ a data frame (or another object) to a new function without having to keep on specifying the data frame. In addition, we can chain pipes together indefinitely.
Here’s how we would subset the data frame using piping:
df.Africa <- filter(df, region == "Africa") # This is the version without piping
df.Africa %>% filter(region == "Africa") -> df.Africa # This is the version with piping. It looks longer, but we can chain multiple functions together!
Note that to create a new data frame, we need a solid arrow at the end. If we don’t include that solid arrow, the results are shown in the console, but no new data frame is created. This is an incredibly useful feature of pipes. You can try before you buy!
And here is an example where we chain a series of pipes together:
df %>%
group_by(region) %>%
summarise(mean.happiness = mean(happiness_score)) ->
df.mean.happiness.by.region
`summarise()` ungrouping output (override with `.groups` argument)
NB When piping the code becomes more readable when the line ends with the pipe.
There are a couple of important points to note.
->. If we don’t store the results they will merely be displayed in the console.Piping is a key technique in R and once you’ve learnt it you will write much more powerful and readable code.
As well as using pipes to create data frame, you can also insert pipes into both analyses and figures! Here are some examples
# An ANOVA without a pipe. NB we are using the base function "aov". If you would like to conduct SPSS-style ANOVAs, the best package is called "afex".
mod <- aov(happiness_rank ~ region, data = df)
pacman::p_load(broom) # To load the "tidy" function.
tidy(mod)
# Here we use a pipe inside the analysis
mod <- aov(happiness_rank ~ region, # NB note we can break the line after a comma
data = df %>% filter(region == "Africa" | region == "South America"))
tidy(mod)
g <- ggplot(aes(x = gdp_per_capita, y = happiness_score, colour = region), # NB note we can break the line after a comma
data = df %>% filter(region == "Africa" | region == "South America"))
g <- g + geom_point()
g <- g + geom_smooth(method = "lm")
g
NA
NA
Note how I have broken some of the lines after a comma. This makes the code more readable. Generally we can break a line when it ends in some kind of symbol, e.g. a pipe, an arrow, or a comma.
Loops and if-then statements are useful programming tools which have the same structure: FUNCTION (STATEMENT) {.....}.
for(i in 1:10){
print(as.character(i))
}
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "10"
To demonstrate a loop we’re going to look at the WHR data set. We’re going to ask the question ’for different regions of the world, what is the relationship between GDP per capita nd happiness?
Here’s how we would do it
# This code drops regions where number of observations are less than 3 (we can't do correlations if there are less than 3 observations)
df %>%
group_by(region) %>%
summarise(num = n()) %>%
filter(num > 3) ->
df.region
`summarise()` ungrouping output (override with `.groups` argument)
# Here is the code with the loop
for (i in 1:length(df.region$region)){ # We loop through the list
df %>% filter(region == df.region$region[i]) -> temp.df # we subset the data according to the region. This contains a temporary dataset "temp.df"
model <- cor.test(temp.df$gdp_per_capita, temp.df$happiness_score) # We do the analysis
print(paste("Region: ", df.region$region[i])) # We print the results
print(model)
}
[1] "Region: Africa"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 2.6915, df = 37, p-value = 0.01062
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1021586 0.6386187
sample estimates:
cor
0.4046332
[1] "Region: Central America"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 3.9995, df = 10, p-value = 0.002521
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3828945 0.9366585
sample estimates:
cor
0.7844238
[1] "Region: Central Asia"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 2.6962, df = 12, p-value = 0.01944
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1240646 0.8634151
sample estimates:
cor
0.6142128
[1] "Region: Europe"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 8.2978, df = 40, p-value = 3.134e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6480351 0.8852640
sample estimates:
cor
0.7953214
[1] "Region: Middle East"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 5.6137, df = 15, p-value = 4.938e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5666445 0.9341741
sample estimates:
cor
0.8231111
[1] "Region: South America"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 1.3148, df = 9, p-value = 0.2211
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2614220 0.8069662
sample estimates:
cor
0.4014009
[1] "Region: South Asia"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 0.48989, df = 4, p-value = 0.6499
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7109133 0.8796331
sample estimates:
cor
0.2379102
[1] "Region: South East Asia"
Pearson's product-moment correlation
data: temp.df$gdp_per_capita and temp.df$happiness_score
t = 3.7425, df = 9, p-value = 0.004608
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3390982 0.9401079
sample estimates:
cor
0.7802562
The code below creates a sequence ranging from 0 to 30 going up in steps of 0.25. Try to achieve the same result using a loop
seq(0,30,2.5)
[1] 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0
To demonstrate if-then statements, we are going to create a new variable which shows if the happiness index is above the mean
df$happiness_above_mean <- 0 # Set variable to 0
mean_happiness <- mean(df$happiness_score) # Calculate mean mpg
for (i in 1:nrow(df)){
if(df$happiness_score[i] > mean_happiness){df$happiness_above_mean[i] <- 1}
}
And here is the same process using dplyr, which avoids the loop adn the if-then statement.
df %>%
transmute(happiness_above_mean = as.numeric(happiness_score > mean(happiness_score))) ->
df
Note loops and if-then statements are quite verbose, and there is almost always a neater and much shorter alternatives. However, I think they are useful procedures for the relative beginner.
Here is a much easier way to create the same variable
df <- read_csv("WHR_2017.csv")
df$happiness_above_mean2 <- as.numeric(df$happiness_score > mean(df$happiness_score))
So how does this work? The statement in brackets evaluates to TRUE / FALSE. We then turn this into a number using as.numeric. TRUE evaluates to 1, while FALSE evaluates to 0.
It can be quite useful to chain statements. For example, if we wish to identify countries where both the happiness score and life expectancy are above the mean, we could do this….
df$happiness_and_LE_above_mean <- as.numeric((df$happiness_score > mean(df$happiness_score)) & (df$life_expectancy > mean(df$life_expectancy)))
## EX6: Creating variables
Try to identify countries where both the GDP per capita and trust in the government are above the mean.
Whenever you run an analysis in R and save that to an object, the object has an internal structure. To demonstrate this, let’s do a simple regression using the mtcars dataset:
df <- read.csv("WHR_2017.csv")
head(df)
NA
Let’s draw a plot looking at the relationship between GDP per capita and Happiness Score. We’re not going to focus on the code, which will be covered in the next session.
g <- ggplot(aes(x = df$gdp_per_capita, y = df$happiness_score), data = df)
g <- g + geom_point()
g <- g + geom_smooth()
g
Now let’s run a regression
mod <- lm(happiness_score ~ gdp_per_capita, data = df) # mod = "model"
pacman::p_load(broom) # Broom is a package which produces neat tables of results
tidy(mod) # This is a broom function which tidies up the statistical results for reporting
NA
Now, let’s have a look at the structure of this model. There are two ways to do this:
str function, e.g. str(mod)mod$, and then use autocomplete.We can see that the $ symbol has a dual function in R: firstly, to specify variables within dataframes, and secondly to specify subcomponents of an object.
It is useful to be able to refer to subcomponents of an object so that we can integrate into our report, e.g. the regression yielded a value of 0.6601055
Once you get stuck have a look at the first code chunk below. This contains the solution but with pesky errors added! See if you can sort out the errors.
whr <- read_csv("WHR_2017.csv")
le <- read_csv("WHO_life_expectancy.csv")
whr %>% # NB we need to ensure that the "country" variable has exactly the same name in both datasets
rename(country = Country) ->
whr
le %>%
select(Year = 2015) %>%
merge(whr) %>%
df
plot(df$`Life expectancy`, df$GDP)
plot(df$`Life expectancy`, df$family)
The code in this chunk shows the solution!
whr <- read_csv("WHR_2017.csv")
le <- read_csv("WHO_life_expectancy.csv")
whr %>%
rename(Country = country) ->
whr
le %>%
filter(Year == 2015) %>%
merge(whr) ->
df
plot(df$`Life expectancy`, df$GDP)
plot(df$`Life expectancy`, df$family)
5.3 Comments
If you’d like to comment on any code you write (i.e. you do not wish R to try to ‘run’ this code) just add a hash (
#) or series of hashes in front of it, e.g.df <- read.csv("csv_file.csv") # This reads in the main file for the experiment